Carbon


IntlTokenize

Header: Script.h Carbon status: Supported

Allows your application to convert text into a sequence of language-independent tokens.

TokenResults IntlTokenize (
    TokenBlockPtr tokenParam
);
Parameter descriptions
tokenParam

A pointer to a token block structure, TokenBlock. The structure specifies the text to be converted to tokens, the destination of the token list, a handle to the tokens ('itl4') resource, and a set of options.

function result

A list of tokens that correspond to the input text. The token list is an array of token structures (type TokenRec). Each token structure describes the token generated, specifies the part of the source text it came from, and optionally provides a character string that is a normalized version of the text that generated the token.

IntlTokenize also returns a result code that specifies the type of error that occurred, if any.

DISCUSSION

The token block structure is a parameter block. The relevant fields of the parameter block are:

Before calling the IntlTokenize function, allocate memory for and set up the following data structures:

IntlTokenize creates tokens based on information in the tokens ('itl4') resource of the script system under which the source text was created. You must load the tokens resource and place its handle in the token block structure before calling the IntlTokenize function.

The token block structure contains both input and output values. At input, you must provide values for the fields that specify the source text location, the token list location, the size of the token list, the tokens ('itl4') resource to use, and several options that affect the operation. You must set reserved locations to 0 before calling IntlTokenize.

On output, the token block structure specifies how many tokens have been generated and the size of the string list (if you have selected the option to generate strings).

The results of the tokenizing operation are contained in the token list, an array of token structures (data type TokenRec).

Pascal strings are generated if the doString parameter in the token block structure is set to TRUE. The string is a normalized version of the source text that generated the token; alternate digits are replaced with ASCII numerals, the decimal point is always an ASCII period, and 2-byte Roman letters are replaced with low-ASCII equivalents.

To make a series of calls to IntlTokenize and append the results of each call to the results of previous calls, set doAppend to FALSE and initialize tokenCount and stringCount to 0 before making the first call to IntlTokenize. (You can ignore stringCount if you set doString to FALSE.) Upon completion of the call, tokenCount and stringCount will contain the number of tokens and the length in bytes of the string list, respectively, generated by the call. On subsequent calls, set doAppend to TRUE, reset the source and sourceLength parameters (and any other parameters as appropriate) for the new source text, but maintain the output values for tokenCount and stringCount from each call as input values to the next call. At the end of your sequence of calls, the token list and string list will contain, in order, all the tokens and strings generated from the calls to IntlTokenize.

If you are making tokens from text that was created under more than one script system, you must load the proper tokens resource and place its handle in the token block structure separately for each script run in the text, appending the results each time.

Delimiters for quoted literals are passed to IntlTokenize in a two-integer array.

The individual delimiters, as specified in the leftDelims and rightDelims parameters, are paired by position. The first (in storage order) opening delimiter in leftDelims is paired with the first closing delimiter in rightDelims.

Comment delimiters may be 1 or 2 tokens each and there may be two sets of opening and closing pairs. They are passed to IntlTokenize in a commentType array.

If only one token is needed for a delimiter, the second token must be specified to be delimPad. If only one delimiter of an opening-closing pair is needed, then both of the tokens allocated for the other symbol must be delimPad. The first token of a two-token sequence is at the higher position in the leftComment or rightComment array.

When IntlTokenize encounters an escape character within a quoted literal, it places the portion of the literal before the escape character into a single token (of type tokenLiteral), places the escape character into another token (tokenEscape), places the character following the escape character into another token (whatever token type it corresponds to), and places the portion of the literal following the escape sequence into another token (tokenLiteral). Outside of a quoted literal, the escape character has no special significance.

IntlTokenize considers the character specified in the decimalCode parameter to be a decimal character only when it is flanked by numeric or alternate numeric characters, or when it follows them.

SPECIAL CONSIDERATIONS

IntlTokenize may move memory; your application should not call this function at interrupt time.

Because each call to IntlTokenize must be for a single script run, there can be no change of script within a comment or quoted literal.

Comments and quoted literals must be complete within a single call to IntlTokenize in order to avoid syntax errors.

IntlTokenize always uses the tokens resource whose handle you pass it in the token block structure. Therefore, it is not directly affected by the state of the font force flag or the international resources selection flag. However, if you use the GetIntlResource function to get a handle to the tokens resource to pass to IntlTokenize, remember that GetIntlResource is affected by the state of the international resources selection flag.

AVAILABILITY

Supported in Carbon. Available in Carbon 1.0.2 and later when running Mac OS 8.1 or later.


© 2000 Apple Computer, Inc. (Last Updated 6/30/2000)